arXiv: 1503.0255lv2 [stat.ML] 9Jun2015 


Kernel-Based Just-In-Time Learning for 
Passing Expectation Propagation Messages 


Wittawat Jitkrittum , 1 Arthur Gretton , 1 Nicolas Heess,* S. M. Ali Eslami* 
Balaji Lakshminarayanan , 1 Dino Sejdinovic 2 and Zoltan Szabo 1 

Gatsby Unit, University College London 1 
University of Oxford 2 

{wittawatj, arthur.gretton}@gmail.com, nheess@gmail.com 
ali@arkitus.com, balajiggatsby.ucl.ac.uk, 
dino.sejdinovicSgmail.com, zoltan.szabo@gatsby.ucl.ac.uk 


Abstract 

We propose an efficient nonparametric strategy 
for learning a message operator in expectation 
propagation (EP), which takes as input the set 
of incoming messages to a factor node, and pro¬ 
duces an outgoing message as output. This 
learned operator replaces the multivariate inte¬ 
gral required in classical EP, which may not have 
an analytic expression. We use kernel-based re¬ 
gression, which is trained on a set of probabil¬ 
ity distributions representing the incoming mes¬ 
sages, and the associated outgoing messages. 
The kernel approach has two main advantages: 
first, it is fast, as it is implemented using a novel 
two-layer random feature representation of the 
input message distributions; second, it has prin¬ 
cipled uncertainty estimates, and can be cheaply 
updated online, meaning it can request and in¬ 
corporate new training data when it encounters 
inputs on which it is uncertain. In experiments, 
our approach is able to solve learning problems 
where a single message operator is required for 
multiple, substantially different data sets (logis¬ 
tic regression for a variety of classification prob¬ 
lems), where it is essential to accurately assess 
uncertainty and to efficiently and robustly update 
the message operator. 


1 INTRODUCTION 

An increasing priority in Bayesian modelling is to make 
inference accessible and implementable for practitioners, 
without requiring specialist knowledge. This is a goal 


sought, for instance, in probabilistic programming lan¬ 
guages (Wingate et al., 2011; Goodman et al., 2008), as 
well as in more granular, component-based systems (Stan 
Development Team, 2014; Minka et ak, 2014). In all cases, 
the user should be able to freely specify what they wish 
their model to express, without having to deal with the 
complexities of sampling, variational approximation, or 
distribution conjugacy. In reality, however, model con¬ 
venience and simplicity can limit or undermine intended 
models, sometimes in ways the users might not expect. To 
take one example, the inverse gamma prior, which is widely 
used as a convenient conjugate prior for the variance, has 
quite pathological behaviour (Gelman, 2006). In general, 
more expressive, freely chosen models are more likely 
to require expensive sampling or quadrature approaches, 
which can make them challenging to implement or imprac¬ 
tical to run. 

We address the particular setting of expectation propaga¬ 
tion (Minka, 2001), a message passing algorithm wherein 
messages are confined to being members of a particular 
parametric family. The process of integrating incoming 
messages over a factor potential, and projecting the result 
onto the required output family, can be difficult, and in 
some cases not achievable in closed form. Thus, a num¬ 
ber of approaches have been proposed to implement EP 
updates numerically, independent of the details of the fac¬ 
tor potential being used. One approach, due to Barthelme 
and Chopin (2011), is to compute the message update via 
importance sampling. While these estimates converge to 
the desired integrals for a sufficient number of importance 
samples, the sampling procedure must be run at every iter¬ 
ation during inference, hence it is not viable for large-scale 
problems. 

An improvement on this approach is to use importance 
sampled instances of input/output message pairs to train 
a regression algorithm, which can then be used in place 
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of the sampler. Heess et al. (2013) use neural networks 
to learn the mapping from incoming to outgoing messages, 
and the learned mappings perform well on a variety of prac¬ 
tical problems. This approach comes with a disadvantage: 
it requires training data that cover the entire set of possible 
input messages for a given type of problem (e.g., datasets 
representative of all classification problems the user pro¬ 
poses to solve), and it has no way of assessing the uncer¬ 
tainty of its prediction, or of updating the model online in 
the event that a prediction is uncertain. 

The disadvantages of the neural network approach were the 
basis for work by Eslami et al. (2014), who replaced the 
neural networks with random forests. The random forests 
provide uncertainty estimates for each prediction. This al¬ 
lows them to be trained ‘just-in-time’, during EP inference, 
whenever the predictor decides it is uncertain. Uncertainty 
estimation for random forests relies on unproven heuris¬ 
tics, however: we demonstrate empirically that such heuris¬ 
tics can become highly misleading as we move away from 
the initial training data. Moreover, online updating can re¬ 
sult in unbalanced trees, resulting in a cost of prediction of 
0(N ) for training data of size N, rather than the ideal of 

0(log(iV)). 

We propose a novel, kernel-based approach to learning a 
message operator nonparametrically for expectation prop¬ 
agation. The learning algorithm takes the form of a distri¬ 
bution regression problem, where the inputs are probability 
measures represented as embeddings of the distributions to 
a reproducing kernel Hilbert space (RKHS), and the out¬ 
puts are vectors of message parameters (Szabo et al., 2014). 
A first advantage of this approach is that one does not need 
to pre-specify customized features of the distributions, as in 
(Eslami et al., 2014; Heess et al., 2013). Rather, we use a 
general characteristic kernel on input distributions (Christ- 
mann and Steinwart, 2010, eq. 9). To make the algorithm 
computationally tractable, we regress directly in the pri¬ 
mal from random Fourier features of the data (Rahimi and 
Recht, 2007; Le et al., 2013; Yang et al., 2015). In par¬ 
ticular, we establish a novel random feature representation 
for when inputs are distributions, via a two-level random 
feature approach. This gives us both fast prediction (linear 
in the number of random features), and fast online updates 
(quadratic in the number of random features). 

A second advantage of our approach is that, being an in¬ 
stance of Gaussian process regression, there are well es¬ 
tablished estimates of predictive uncertainty (Rasmussen 
and Williams, 2006, Ch. 2). We use these uncertainty es¬ 
timates so as to determine when to query the importance 
sampler for additional input/output pairs, i.e., the uncertain 
predictions trigger just-in-time updates of the regressor. We 
demonstrate empirically that our uncertainty estimates are 
more robust and informative than those for random forests, 
especially as we move away from the training data. 
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Figure 1: Distributions of incoming messages to logistic 
factor in four different UCI datasets. 

Our paper proceeds as follows. In Section 2, we introduce 
the notation for expectation propagation, and indicate how 
an importance sampling procedure can be used as an ora¬ 
cle to provide training data for the message operator. We 
also give a brief overview of previous learning approaches 
to the problem, with a focus on that of Eslami et al. (2014). 
Next, in Section 3, we describe our kernel regression ap¬ 
proach, and the form of an efficient kernel message oper¬ 
ator mapping the input messages (distributions embedded 
in an RKHS) to outgoing messages (sets of parameters of 
the outgoing messages). Finally, in Section 4, we describe 
our experiments, which cover three topics: a benchmark of 
our uncertainty estimates, a demonstration of factor learn¬ 
ing on artificial data with well-controlled statistical prop¬ 
erties, and a logistic regression experiment on four differ¬ 
ent real-world datasets, demonstrating that our just-in-time 
learner can correctly evaluate its uncertainty and update 
the regression function as the incoming messages change 
(see Fig. 1). Code to implement our method is avail¬ 
able online at https : / /github . com/wittawat j / 
kernel-ep. 

2 BACKGROUND 

We assume that distributions (or densities) p over a set of 
variables x = (x \,... Xd) of interest can be represented as 
factor graphs, i.e. p(x) = jy n/=i /j( x ne(/,))- The fac¬ 
tors fj are non-negative functions which are defined over 
subsets x ne (£.) of the full set of variables x. These vari¬ 
ables form the neighbors of the factor node fj in the factor 
graph, and we use ne(/j ) to denote the corresponding set 
of indices. Z is the normalization constant. 

We deal with models in which some of the factors have 
a non-standard form, or may not have a known analytic 
expression (i.e. “black box” factors). Although our ap¬ 
proach applies to any such factor in principle, in this pa¬ 
per we focus on directed factors /(x ou t|x; n ) which spec¬ 
ify a conditional distribution over variables x out given x in 
(and thus ~x. ne (f) = (x ou t,x; n )). The only assumption 
we make is that we are provided with a forward sam¬ 
pling function / : x; n i—»• x out , i.e., a function that maps 
(stochastically or deterministically) a setting of the input 
variables Xi n to a sample from the conditional distribution 









over x out ~ /(-|xi n ). In particular, the ability to evaluate 
the value of /(x out |xi n ) is not assumed. A natural way to 
specify / is as code in a probabilistic program. 


support, and compute importance weights 


™( X ne(/)) 


IIwene(/) m w ^f(x w ) 
r(x in) 


2.1 EXPECTATION PROPAGATION 


Expectation Propagation (EP) is an approximate iterative 
procedure for computing marginal beliefs of variables by 
iteratively passing messages between variables and factors 
until convergence (Minka, 2001). It can be seen as an alter¬ 
native to belief propagation, where the marginals are pro¬ 
jected onto a member of some class of known parametric 
distributions. The message mf^,v(xv) from factor / to 
variable V £ ne(/) is 


proj 


J7( X ne(/)) Iltr'enef/) m V >^f[x V >) dx ne(/) \y 


m v ^f(x v ) 

(1) 

where rnv>-tf are the messages sent to factor / 
from all of its neighboring variables xy, proj [p] = 
argmiiiqg qKL \p\\q\, and Q is typically in the exponential 
family, e.g. the set of Gaussian or Beta distributions. 


Computing the numerator of (1) can be challenging, as it 
requires evaluating a high-dimensional integral as well as 
minimization of the Kullback-Leibler divergence to some 
non-standard distribution. Even for factors with known 
analytic form this often requires hand-crafted approxima¬ 
tions, or the use of expensive numerical integration tech¬ 
niques; for “black-box” factors implemented as forward 
sampling functions, fully nonparametric techniques are 
needed. 


Barthelme and Chopin (2011); Heess et al. (2013); Eslami 
et al. (2014) propose an alternative, stochastic approach to 
the integration and projection step. When the projection is 
to a member q{x\if) = h{x) exp (p T u(x) — A(r))) of an 
exponential family, one simply computes the expectation 
of the sufficient statistic u(-) under the numerator of (1). 
A sample based approximation of this expectation can be 
obtained via Monte Carlo simulation. Given a forward¬ 
sampling function / as described above, one especially 
simple approach is importance sampling. 
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where x 1 ,^ ~ 6, for l = 1,,M and on the left hand 
side. 


6(Xne(/))=/( -X-ne (/)) n mw^f{xw)- 

wene(f) 

On the right hand side we draw samples x 1 ,^ from some 

proposal distribution b which we choose to be 6(x ne (^)) = 
r(xj rl )/(x out |x; n ) for some distribution r with appropriate 


Thus the estimated expected sufficient statistics provide us 
with an estimate of the parameters q of the result q of the 
projection proj [p], from which the message is readily com¬ 
puted. 

2.2 JUST-IN-TIME LEARNING OF MESSAGES 

Message approximations as in the previous section could 
be used directly when running the EP algorithm, as in 
Barthelme and Chopin (2011), but this approach can suf¬ 
fer when the number of samples M is small, and the im¬ 
portance sampling estimate is not reliable. On the other 
hand, for large M the computational cost of running EP 
with approximate messages can be very high, as impor¬ 
tance sampling must be performed for sending each outgo¬ 
ing message. To obtain low-variance message approxima¬ 
tions at lower computational cost, Heess et al. (2013) and 
Eslami et al. (2014) both amortize previously computed ap¬ 
proximate messages by training a function approximator to 
directly map a tuple of incoming variable-to-factor mes¬ 
sages (mv'-yf)v'ene(f) to an approximate factor to vari¬ 
able message i.e. they learn a mapping 

Mf^ v : (rnv'^f)v'ene(f) ^ rrif^v, ( 3 ) 

where 9 are the parameters of the approximator. 

Heess et al. (2013) use neural networks and a large, fixed 
training set to learn their approximate message operator 
prior to running EP. By contrast, Eslami et al. (2014) em¬ 
ploy random forests as their class of learning functions, 
and update their approximate message operator on the fly 
during inference, depending on the predictive uncertainty 
of the current message operator. Specifically, they endow 
their function approximator with an uncertainty estimate 

: (rov'-t/Jv'gnef/) >->• D, (4) 

where 0 indicates the expected unreliability of the pre¬ 
dicted, approximate message returned by M®^ v . 

If 0 = {(m V '-yf) V i£ ne (f)) exceeds a pre¬ 

defined threshold, the required message is approximated 
via importance sampling (cf. (2)) and is up¬ 

dated on this new datapoint (leading to a new set 
of parameters 0' with QJ^y ((mv->f)v'£ne(f))) < 

(K'-t/jv'enef/))' 

Eslami et al. (2014) estimate the predictive uncertainty 
via the heuristic of looking at the variability of the 
forest predictions for each point (Criminisi and Shotton, 
2013). They implement their online updates by splitting the 
trees at their leaves. Both these mechanisms can be prob¬ 
lematic, however. First, the heuristic used in computing un¬ 
certainty has no guarantees: indeed, uncertainty estimation 






for random forests remains a challenging topic of current 
research (Hutter, 2009). This is not merely a theoretical 
consideration: in our experiments in Section 4, we demon¬ 
strate that uncertainty heuristics for random forests become 
unstable and inaccurate as we move away from the initial 
training data. Second, online updates of random forests 
may not work well when the newly observed data are from 
a very different distribution to the initial training sample 
(e.g. Lakshminarayanan et ah, 2014, Fig. 3). For large 
amounts of training set drift, the leaf-splitting approach of 
Eslami et al. can result in a decision tree in the form of 
a long chain, giving a worst case cost of prediction (com¬ 
putational and storage) of O(N) for training data of size 
N, vs the ideal of 0(}og(N)) for balanced trees. Finally, 
note that the approach of Eslami et al. uses certain bespoke 
features of the factors when specifying tree traversal in the 
random forests, notably the value of the factor potentials 
at the mean and mode of the incoming messages. These 
features require expert knowledge of the model on the part 
of the practitioner, and are not available in the “forward 
sampling” setting. The present work does not employ such 
features. 

In terms of computational cost, prediction for the random 
forest of Eslami et al. costs 0{KD r D t \og{N)), 
and updating following a new observation costs 
0(KD^.D t log (TV)), where K is the number of trees 
in the random forest, D t is the number of features used 
in tree traversal, I) r is the number of features used in 
making predictions at the leaves, and N is the number of 
training messages. Representative values are K = 64, 
D t = D r k, 15, and N in the range of 1,000 to 5,000. 

3 KERNEL LEARNING OF OPERATORS 

We now propose a kernel regression method for jointly 
learning the message operator and uncertainty esti¬ 

mate 03® We regress from the tuple of incoming mes¬ 
sages, which are probability distributions, to the parameters 
of the outgoing message. To this end we apply a kernel over 
distributions from (Christmann and Steinwart, 2010) to the 
case where the input consists of more than one distribution. 

We note that Song et al. (2010, 2011) propose a related re¬ 
gression approach for predicting outgoing messages from 
incoming messages, for the purpose of belief propagation. 
Their setting is different from ours, however, as their mes¬ 
sages are smoothed conditional density functions rather 
than parametric distributions of known form. 

To achieve fast predictions and factor updates, we follow 
Rahimi and Recht (2007); Le et al. (2013); Yang et al. 
(2015), and express the kernel regression in terms of ran¬ 
dom features whose expected inner product is equal to the 
kernel function; i.e. we perform regression directly in the 
primal on these random features. In Section 3.1, we de¬ 


fine our kernel on tuples of distributions, and then derive 
the corresponding random feature representation in Sec¬ 
tion 3.2. Section 3.3 describes the regression algorithm, as 
well as our strategy for uncertainty evaluation and online 
updates. 


3.1 KERNELS ON TUPLES OF DISTRIBUTIONS 


In the following, we consider only a single factor, and 
therefore drop the factor identity from our notation. We 
write the set of c incoming messages to a factor node as a 
tuple of probability distributions R := (r^)j : =1 of random 
variables on respective domains X <r> . Our goal is to 
define a kernel between one such tuple, and a second one, 
which we will write S := (s^)j : =1 . 

We define our kernel in terms of embeddings of the tuples 
R, S into a reproducing kernel Hilbert space (RKHS). We 
first consider the embedding of a single distribution in the 
tuple: Let us define an RKHS on each domain, with 
respective kernel fcW [x’-p , x^). We may embed individual 
probability distributions to these RKHSs, following (Smola 
et al., 2007). The mean embedding of r W is written 

AM ! >(') : = J k^ l \x {l \-)dr (l \x (l) ). (5) 

Similarly, a mean embedding may be defined on the prod¬ 
uct of messages in a tuple r = xjL l A l ' > as 

Mr := J k([x (1 \...,x {c '>],-)dr(x (1 \...,x ic) ), (6) 


where we have defined the joint kernel k on the product 
space AfW x ■ • • x X( c \ Finally, a kernel on two such em¬ 
beddings fi r , fi s of tuples R, S can be obtained as in Christ¬ 
mann and Steinwart (2010, eq. 9), 


«(r, s) = exp 




(7) 


This kernel has two parameters: r y 2 , and the width parame¬ 
ter of the kernel k defining /.t r = E x ^ r fc(ir, •). 

We have considered several alternative kernels on tuples 
of messages, including kernels on the message parameters, 
kernels on a tensor feature space of the distribution em¬ 
beddings in the tuple, and dot products of the features (6). 
We have found these alternatives to have worse empirical 
performance than the approach described above. We give 
details of these experiments in Section C of the supplemen¬ 
tary material. 


3.2 RANDOM FEATURE APPROXIMATIONS 

One approach to learning the mapping M^ v from incom¬ 
ing to outgoing messages would be to employ Gaussian 
process regression, using the kernel (7). This approach is 



not suited to just-in-time (JIT) learning, however, as both 
prediction and storage costs grow with the size of the train¬ 
ing set; thus, inference on even moderately sized datasets 
rapidly becomes computationally prohibitive. Instead, we 
define a finite-dimensional random feature map ij) £ R 0 "" 1 
such that k(t,s) ~ and regress directly on 

these feature maps in the primal (see next section): stor¬ 
age and computation are then a function of the dimension 
of the feature map D out , yet performance is close to that 
obtained using a kernel. 

In Rahimi and Recht (2007), a method based on Fourier 
transforms was proposed for computing a vector of ran¬ 
dom features (p for a translation invariant kernel k(x. y) = 
k(x — y) such that k(x , y) ~ ip(x) T ip(y) where x, y £ 
and i p(x),kp(y) £ M Al1 . This is possible because of 
Bochner’s theorem (Rudin, 2013), which states that a con¬ 
tinuous, translation-invariant kernel k can be written in the 
form of an inverse Fourier transform: 



where j = \/ -1 and the Fourier transform k of the kernel 
can be treated as a distribution. The inverse Fourier trans¬ 
form can thus be seen as an expectation of the complex ex¬ 
ponential, which can be approximated with a Monte Carlo 
average by drawing random frequencies from the Fourier 
transform. We will follow a similar approach, and derive a 
two-stage set of random Fourier features for (7). 

We start by expanding the exponent of (7) as 

6XP ^ r,/A> + ^2 - ^2 • 

Assume that the embedding kernel k used to define the 
embeddings // r and // s is translation invariant. Since 
(/z r ,/n s ) = E x ^ r E y ^ s fc(a: — y), one can use the result of 
Rahimi and Recht (2007) to write 

(fin Ms) ~ E x ^ r E y ~ s ip(x) T (p(y) 

= E x ^ r <£(a;) T E v ~ s 0(y) := ^(r) T 0(s), 


where the mappings (j) are l)\ u standard Rahimi-Recht ran¬ 
dom features, shown in Steps 1-3 of Algorithm 1. 

With the approximation of (/z r , fi s ), we have 


t(r, s) « exp — 


U(r) ~0(s)|| 2 a , 

2 7 2 


( 8 ) 


which is a standard Gaussian kernel on R W|n . We can thus 
further approximate this Gaussian kernel by the random 
Fourier features of Rahimi and Recht, to obtain a vector 
of random features i/j such that k( r, s) « if>(r) T -0(s) where 
ip(r),ip(s) £ R Dout . Pseudocode for generating the ran¬ 
dom features if; is given in Algorithm 1. Note that the sine 


Algorithm 1 Construction of two-stage random features 
for k 

Input: Input distribution r, Fourier transform k of the em¬ 
bedding translation-invariant kernel k, number of in¬ 
ner features D m , number of outer features D aut , outer 
Gaussian width 7 2 . 

Output: Random features tp(r) £ R Dout . 

1: Sample k. 

2: Sample ^ UniformfO, 2it\. 

3: 4>{r) = (E^ r cos (wjx + h))^ £ R A ” 

If r(x) = Af(x; m, £), 


: + hi) exp I — -ujJ E to, 


D in 

i=1 


4: Sample {i/i} A"* fc gau ss( 7 2 ) i.e.. Fourier transform of a 
Gaussian kernel with width 7 2 . 

5: Sample Uniform[0, 2n]. 

6: = ^5Iff ( c °s(^ T 0(r) + c i ))._ i G R Dout 


component in the complex exponential vanishes due to the 
translation invariance property (analogous to an even func¬ 
tion), i.e., only the cosine term remains. We refer to Sec¬ 
tion B.3 in the supplementary material for more details. 

For the implementation, we need to pre-compute 

l > and {gISi*’ where An and 

Dout are the number of random features used. A more ef¬ 
ficient way to support a large number of random features is 
to store only the random seed used to generate the features, 
and to generate the coefficients on-the-fly as needed (Dai 
et al., 2014). In our implementation, we use a Gaussian 
kernel for /,:. 


3.3 REGRESSION FOR OPERATOR 
PREDICTION 

Let X = (xi| • ■ ■ |xjv) be the N training samples of 
incoming messages to a factor node, and let Y = 

( E *v~?}^ V r“( a; v)l • • • \ E x v ~ q y^A x vi) € R D v xN be 

the expected sufficient statistics of the corresponding out¬ 
put messages, where q l f_yy is the numerator of (1). We 
write Xj = 'tp(jt) as a more compact notation for the ran¬ 
dom feature vector representing the i th training tuple of 
incoming messages, as computed via Algorithm 1. 

Since we require uncertainty estimates on our predictions, 
we perform Bayesian linear regression from the random 
features to the output messages, which yields predictions 
close to those obtained by Gaussian process regression with 
the kernel in (7). The uncertainty estimate in this case will 










be the predictive variance. We assume prior and likelihood 

w ~ Af (w,0,I Dont a%) , (9) 

Y | X,w~ AT (Y;w T X, a%I N ) , (10) 

where the output noise variance a 2 captures the intrinsic 
stochasticity of the importance sampler used to generate Y. 
It follows that the posterior of w is given by (Bishop, 2006) 

p(w\Y) = AT{w-,p w ,Y, w ), (11) 

£«,= (XX T a- 2 + ( r 0 - 2 /) _1 , (12) 

Hw = E U) XY t ( t- 2 . (13) 

The predictive distribution on the output y* given an obser¬ 
vation x* is 

p(y*|x*, Y) = J p(y*\w,x*,Y)p(w\Y)dw (14) 

= Af(y*;x* T p w ,x* T i: w x*+a 2 y ). (15) 

For simplicity, we treat each output (expected sufficient 
statistic) as a separate regression problem. Treating all out¬ 
puts jointly can be achieved with a multi-output kernel (Al¬ 
varez et ah, 2011). 


Online Update We describe an online update for T, w and 
p w when observations (i.e., random features representing 
incoming messages) arrive sequentially. We use to 
denote a quantity constructed from N samples. Recall that 
= XX T er” 2 + <Jq 2 I. The posterior covariance 
matrix at time N + 1 is 


y( N + 1) = vW 


V (N) T „2 


1 I I vw 

1 T w X]\r + l(Ty 


-2 


06 ) 


meaning it can be expressed as an inexpensive update of the 
covariance at time N. Updating X„, for all the D y outputs 
costs 0((Di n D out + -Dout )D y ) per new observation. For 
fi w = £„,XY T er“ 2 , we maintain XY T £ R 0 ™* x D y , and 
update it at cost 0(Di n D out D y ) as 

(XY t ) (JV+1) = (XY t + X7V+1 y^ +1 ) . (17) 

Since we have D y regression functions, for each tuple of 
incoming messages x*, there are D y predictive variances, 
, one for each output. Let be pre¬ 

specified predictive variance thresholds. Given a new input 
x*, if Vi > T| or ■ ■ ■ or v* D > td v (the operator is uncer¬ 
tain), a query is made to the oracle to obtain a ground truth 
y*. The pair (x*, y*) is then used to update T, w and p w . 


4 EXPERIMENTS 

We evaluate our learned message operator using two differ¬ 
ent factors: the logistic factor, and the compound gamma 


factor. In the first and second experiment we demonstrate 
that the proposed operator is capable of learning high- 
quality mappings from incoming to outgoing messages, 
and that the associated uncertainty estimates are reliable. 
The third and fourth experiments assess the performance 
of the operator as part of the full EP inference loop in two 
different models: approximating the logistic, and the com¬ 
pound gamma factor. Our final experiment demonstrates 
the ability of our learning process to reliably and quickly 
adapt to large shifts in the message distribution, as encoun¬ 
tered during inference in a sequence of several real-world 
regression problems. 

For all experiments we used Infer.NET (Minka et al., 2014) 
with its extensible factor interface for our own operator. We 
used the default settings of Infer.NET unless stated other¬ 
wise. The regression target is the marginal belief (numer¬ 
ator of (1)) in experiment 1,2,3 and 5. We set the regres¬ 
sion target to the outgoing message in experiment 4. Given 
a marginal belief, the outgoing message can be calculated 
straightforwardly. 



Figure 2: Factor graph for binary logistic regression. 
The kernel-based message operator learns to approximate 
the logistic factor highlighted in red. The two incom¬ 
ing messages are m Zi ^f = Af (zp, p,, a 2 ) and m Pi ^.f = 
Beta(py a, /3). 


Experiment 1: Batch Learning As in (Heess et ah, 
2013; Eslami et ah, 2014), we study the logistic factor 
f(p\z) = 6 (p- 1+eX p ( _ z) ) , where 5 is the Dirac delta 
function, in the context of a binary logistic regression 
model (Fig. 2). The factor is deterministic and there are 
two incoming messages: m Pi ^f = Beta(py a, /3) and 
= Af (zp, /i, cr 2 ), where Zi = w T Xi represents the 
dot product between an observation Xi £ R d and the coef¬ 
ficient vector w whose posterior is to be inferred. 

In this first experiment we simply learn a kernel-based 
operator to send the message Following Eslami 

et ah (2014), we set d to 20, and generated 20 different 
datasets, sampling a different w ~ Af(0, 1 ) and then a 
set of {(xi, yi)}™ =1 (n = 300) observations according to 
the model. For each dataset we ran EP for 10 iterations, 
and collected incoming-outgoing message pairs in the first 
five iterations of EP from Infer.NET’s implementation of 
the logistic factor. We partitioned the messages randomly 
into 5,000 training and 3,000 test messages, and learned 
a message operator to predict as described in Sec¬ 

tion 3. Regularization and kernel parameters were chosen 








by leave-one-out cross validation. We set the number of 
random features to Di n = 500 and D out = 1,000; em¬ 
pirically, we observed no significant improvements beyond 
1,000 random features. 
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(a) KL errors 


(b) Examples of predictions 


Figure 3: Prediction errors for predicting the projected be¬ 
liefs to Zi, and examples of predicted messages at different 
error levels. 




(a) Parameters of (b) Uncertainty estimates 

Figure 4: (a) Incoming messages from z to f from 20 EP 
runs of binary logistic regression, as shown in Fig. 2. (b) 
Uncertainty estimates of the proposed kernel-based method 
(predictive variance) and Eslami et al.’s random forests 
(KL-based agreement of predictions of different trees) on 
the two uncertainty test sets shown. For testing, we fix the 
other incoming message m Pi ^f to Beta (pp, 1, 2). 


We report log KL[g/_ >Zi ||g^_> Zi ] where q/^ Zi is the 
ground truth projected belief (numerator of (1)) and qf^ Zi 
is the prediction. The histogram of the log KL errors is 
shown in Fig. 3a; Fig. 3b shows examples of predicted 
messages for different log KL errors. It is evident that the 
kernel-based operator does well in capturing the relation¬ 
ship between incoming and outgoing messages. The dis¬ 
crepancy with respect to the ground truth is barely visible 
even at the 99th percentile. See Section C in the supple¬ 
mentary material for a comparison with other methods. 

Experiment 2: Uncertainty Estimates For the approx¬ 
imate message operator to perform well in a JIT learning 
setting, it is crucial to have reliable estimates of opera¬ 
tor’s predictive uncertainty in different parts of the space 
of incoming messages. To assess this property we com¬ 
pute the predictive variance using the same learned oper¬ 
ator as used in Fig. 3. The forward incoming messages 
m Zi ^f in the previously used training set are shown in 
Fig. 4a. The backward incoming messages m Pi ^f are not 
displayed. Shown in the same plot are two curves (a blue 
line, and a pink parabola) representing two “uncertainty 
test sets”: these are the sets of parameter pairs on which 
we wish to evaluate the uncertainty of the predictor, and 
pass through regions with both high and low densities of 
training samples. Fig. 4b shows uncertainty estimates of 
our kernel-based operator and of random forests, where we 
fix := Beta(pi;l,2) for testing. The implemen¬ 

tation of the random forests closely follows Eslami et al. 
(2014). 

From the figure, as the mean of the test message moves 
away from the region densely sampled by the training 
data, the predictive variance reported by the kernel method 
increases much more smoothly than that of the random 
forests. Further, our method clearly exhibits a higher un¬ 
certainty on the test set #1 than on the test set #2. This 


behaviour is desirable, as most of the points in test set #1 
are either in a low density region or an unexplored region. 
These results suggest that the predictive variance is a ro¬ 
bust criterion for querying the importance sampling ora¬ 
cle. One key observation is that the uncertainty estimates 
of the random forests are highly non-smooth; i.e., uncer¬ 
tainty on nearby points may vary wildly. As a result, a 
random forest-based JIT learner may still query the impor¬ 
tance sampler oracle when presented with incoming mes¬ 
sages similar to those in the training set, thereby wasting 
computation. 

We have further checked that the predictive uncertainty of 
the regression function is a reliable indication of the error in 
KL divergence of the predicted outgoing messages. These 
results are given in Figure 10 of Appendix C. 

Experiment 3: Just-In-Time Learning In this experi¬ 
ment we test the approximate operator in the logistic re¬ 
gression model as part of the full EP inference loop in 
a just-in-time learning setting (KJIT). We now leam two 
kernel-based message operators, one for each outgoing di¬ 
rection from the logistic factor. The data generation is the 
same as in the batch learning experiment. We sequentially 
presented the operator with 30 related problems, where a 
new set of observations {(xj, yi)}f— i was generated at the 
beginning of each problem from the model, while keeping 
w fixed. This scenario is common in practice: one is of¬ 
ten given several sets of observations which share the same 
model parameter (Eslami et al., 2014). As before, the infer¬ 
ence target was p(w\{(xi, 2/i)}" = i)- We set the maximum 
number of EP iterations to 10 in each problem. 

We employed a “mini-batch” learning approach in which 
the operator always consults the oracle in the first few hun¬ 
dred factor invocations for initial batch training. In princi¬ 
ple, during the initial batch training, the operator can per¬ 
form cross validation or type-II maximum likelihood esti- 
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Figure 5: Uncertainty estimate of KJIT in its prediction of outgoing messages at each factor invocation, for the binary 
logistic regression problem. The black dashed lines indicate the start of a new inference problem. 


mation for parameter selection; however for computational 
simplicity we set the kernel parameters according to the 
median heuristic (Scholkopf and Smola, 2002). Full detail 
of the heuristic is given in Section A in the supplementary 
material. The numbers of random features were D in = 300 
and D out = 500. The output noise variance rjy was fixed 
to 10 -4 and the uncertainty threshold on the log predictive 
variance was set to -8.5. To simulate a black-box setup, we 
used an importance sampler as the oracle rather than In- 
fer.NET’s factor implementation, where the proposal dis¬ 
tribution was fixed to Af(z; 0,200) with 5 x 10 5 particles. 

Fig. 5 shows a trace of the predictive variance of KJIT in 
predicting the mean of each rrif^ Zi upon each factor in¬ 
vocation. The black dashed lines indicate the start of a 
new inference problem. Since the first 300 factor invoca¬ 
tions are for the initial training, no uncertainty estimate is 
shown. From the trace, we observe that the uncertainty 
rapidly drops down to a stable point at roughly -8.8 and 
levels off after the operator sees about 1,000 incoming¬ 
outgoing message pairs, which is relatively low compared 
to approximately 3,000 message passings (i.e., 10 itera¬ 
tions x 300 observations) required for one problem. The 
uncertainty trace displays a periodic structure, repeating it¬ 
self in every 300 factor invocations, corresponding to a full 
sweep over all 300 observations to collect incoming mes¬ 
sages rn Z/ ^ f. The abrupt drop in uncertainty in the first 
EP iteration of each new problem is due to the fact that In- 
fer.NET’s inference engine initializes the message from w 
to have zero mean, leading to m Zi ^f also having a zero 
mean. Repeated encounters of such a zero mean incom¬ 
ing message reinforce the operator’s confidence; hence the 
drop in uncertainty. 

Fig. 6a shows binary classification errors obtained by using 
the inferred posterior mean of in on a test set of size 10000 
generated from the true underlying parameter. Included in 
the plot are the errors obtained by using only the impor¬ 
tance sampler for inference (“Sampling”), and using the 
Infer.NET’s hand-crafted logistic factor. The loss of KJIT 
matches well with that of the importance sampler and In- 
fer.NET, suggesting that the inference accuracy is as good 
as these alternatives. Fig. 6b shows the inference time re¬ 
quired by all methods in each problem. While the inference 
quality is equally good, KJIT is orders of magnitude faster 



(a) Binary classification error (b) Inference time 


Figure 6: Classification performance and inference times 
of all methods in the binary logistic regression problem. 


than the importance sampler. 
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Figure 7: Shape (a) and rate (b) parameters of the inferred 
posteriors in the compound gamma problem, (c) KJIT is 
able to infer equally good posterior parameters compared 
to Infer.NET, while requiring a runtime several orders of 
magnitude lower. 


Experiment 4: Compound Gamma Factor We next 
simulate the compound gamma factor, a heavy-tailed prior 
distribution on the precision of a Gaussian random vari¬ 
able. A variable r is said to follow the compound gamma 
distribution if r ~ Gamma(r; S 2 ,r 2 ) (shape-rate parame¬ 
terization) and T 2 ~ Gamma(r 2 ; Si, ri) where (si,ri,S 2 ) 
are parameters. The task we consider is to infer the pos¬ 
terior of the precision t of a normally distributed variable 
x ~ Af(x;0,r) given realizations {aWe consider 
the setting (si, ri, S 2 ) = (1,1,1) which was used in Heess 
et al. (2013). Infer.NET’s implementation requires two 
gamma factors to specify the compound gamma. Here, we 
collapse them into one factor and let the operator learn to 
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Figure 8: Uncertainty estimate of KJIT for outgoing messages on the four UCI datasets. 


directly send an outgoing message given us¬ 

ing Infer.NET as the oracle. The default implementation 
of Infer.NET relies on a quadrature method. As in Eslami 
et al. (2014), we sequentially presented a number of prob¬ 
lems to our algorithm, where at the beginning of each prob¬ 
lem, a random number of observations n from 10 to 100, 
and the parameter r, were drawn from the model. 

Fig. 7a and Fig. 7b summarize the inferred posterior param¬ 
eters obtained from running only Infer.NET and Infer.NET 
+ KJIT, i.e., KJIT with Infer.NET as the oracle. Fig. 7c 
shows the inference time of both methods. The plots collec¬ 
tively show that KJIT can deliver posteriors in good agree¬ 
ment with those obtained from Infer.NET, at a much lower 
cost. Note that in this task only one message is passed to 
the factor in each problem. Fig. 7c also indicates that KJIT 
requires fewer oracle consultations as more problems are 
seen. 


Experiment 5: Classification Benchmarks In the fi¬ 
nal experiment, we demonstrate that our method for learn¬ 
ing the message operator is able to detect changes in the 
distribution of incoming messages via its uncertainty es¬ 
timate, and to subsequently update its prediction through 
additional oracle queries. The different distributions of in¬ 
coming messages are achieved by presenting a sequence of 
different classification problems to our learner. We used 
four binary classification datasets from the UCI repository 
(Lichman, 2013): banknote authentication, blood transfu¬ 
sion, fertility and ionosphere, in the same binary logistic 
regression setting as before. The operator was required to 
learn just-in-time to send outgoing messages ro/_>. Zi and 
on the four problems presented in sequence. The 
training observations consisted of 200 data points subsam¬ 
pled from each dataset by stratified sampling. For the fer¬ 
tility dataset, which contains only 100 data points, we sub¬ 
sampled half the points. The remaining data were used 
as test sets. The uncertainty threshold was set to -9, and 
the minibatch size was 500. All other parameters were the 
same as in the earlier JIT learning experiment. 

Classification errors on the test sets and inference times are 
shown in Fig. 9a and Fig. 9b, respectively. The results 
demonstrate that KJIT improves the inference time on all 
the problems without sacrificing inference accuracy. The 
predictive variance of each outgoing message is shown in 
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Figure 9: Classification performance and inference times 
on the four UCI datasets. 


Fig. 8. An essential feature to notice is the rapid increase of 
the uncertainty after the first EP iteration of each problem. 
As shown in Fig. 1, the distributions of incoming messages 
of the four problems are diverse. The sharp rise followed 
by a steady decrease of the uncertainty is a good indica¬ 
tor that the operator is able to promptly detect a change in 
input message distribution, and robustly adapt to this new 
distribution by querying the oracle. 

5 CONCLUSIONS AND FUTURE WORK 

We have proposed a method for learning the mapping be¬ 
tween incoming and outgoing messages to a factor in ex¬ 
pectation propagation, which can be used in place of com¬ 
putationally demanding Monte Carlo estimates of these up¬ 
dates. Our operator has two main advantages: it can re¬ 
liably evaluate the uncertainty of its prediction, so that it 
only consults a more expensive oracle when it is uncertain, 
and it can efficiently update its mapping online, so that it 
learns from these additional consultations. Once trained, 
the learned mapping performs as well as the oracle map¬ 
ping, but at a far lower computational cost. This is in large 
part due to a novel two-stage random feature representation 
of the input messages. One topic of current research is hy¬ 
perparameter selection: at present, these are learned on an 
initial mini-batch of data, however a better option would be 
to adapt them online as more data are seen. 
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Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages 

SUPPLEMENTARY MATERIAL 

A MEDIAN HEURISTIC FOR GAUSSIAN KERNEL ON MEAN EMBEDDINGS 

In the proposed KJIT, there are two kernels: the inner kernel k for computing mean embeddings, and the outer Gaussian 
kernel k defined on the mean embeddings. Both of the kernels depend on a number of parameters. In this section, 
we describe a heuristic to choose the kernel parameters. We emphasize that this heuristic is merely for computational 
convenience. A full parameter selection procedure like cross validation or evidence maximization will likely yield a better 
set of parameters. We use this heuristic in the initial mini-batch phase before the actual online learning. 

Let {rf ^ |; I = 1,..., c, and i = 1,... , n} be a set of n incoming message tuples collected during the mini-batch phase, 
from c variables neighboring the factor. Let It, := (r^)p =1 be the i th tuple, and let r i := xf =1 r^ be the product of 
incoming messages in the i th tuple. Define S) and s, to be the corresponding quantities of another tuple of messages. We 
will drop the subscript i when considering only one tuple. 

Recall that the kernel on two tuples of messages R and S is given by 

k(R, S ) = k( r, s) = exp 

= exp (- 7^2 + ^2 (Mr, Ms) - ^2 (Ms,Ms)J , 

where (fi n y s ) = E x ^ r E yr ^ s k(x—y). The inner kernel k is a Gaussian kernel defined on the domain X := X W x • • • x X(c) 
where X^> denotes the domain of r 1 !). For simplicity, we assume that X'- 1 ' 1 is one-dimensional. The Gaussian kernel k 
takes the form 

k{x-y) = exp (x - y) T X^ 1 (x - y)j = LI exp 

where X = diag(af,..., of). The heuristic for choosing of,..., <r 2 and 7 is as follows. 

L Set of := £ J27=i v Xi ^ r (0 M where ^ Xl ~r w \- Xl \ denotes the variance of rf \ 

2. With X = diag(of,..., of) as defined in the previous step, set 7 2 := median ({||/x r . — y s . || 2 }"j=i)- 

B KERNELS AND RANDOM FEATURES 

This section reviews relevant kernels and their random feature representations. 

B.l RANDOM FEATURES 

This section contains a summary of Rahimi and Recht (2007)’s random Fourier features for a translation invariant kernel. 

A kernel k(x, y) = {<j>(x), 4>(y)} in general may correspond to an inner product in an infinite-dimensional space whose 
feature map 0 cannot be explicitly computed. In Rahimi and Recht (2007), methods of computing an approximate feature 
maps (f> were proposed. The approximate feature maps are such that k{x, y) ~ 4>(x) T (j)(y) (with equality in expectation) 
where <j>(x), </>(y) £ M 15 and D is the number of random features. High D yields a better approximation with higher 
computational cost. Assume k(x, y) = k{x—y) (translation invariant) and x, y £ R d . Random Fourier features 4>(x) £ R D 
such that k(x, y) « (f>(x) T (f>(y) are generated as follows: 

1. Compute the Fourier transform k of the kernel k: k(uS) = f- / e~^ s k(S)dS. 

2. Draw D i.i.d. samples w 1 ,..., uid £ from k. 

3. Draw D i.i.d samples bi ,..., bn from U[0, 27t] (uniform distribution). 

4. <j>{x) = (cos (ujx + 61 ). ,cos (ujpX + br>)) T £ R D 




Why It Works 

Theorem 1. Bochner’s theorem (Rudin, 2013). A continuous kernel k(x, y) = k( x — y) on R m is positive definite iffk(S) 
is the Fourier transform of a non-negative measure. 

Furthermore, if a translation invariant kernel k(S) is properly scaled, Bochner’s theorem guarantees that its Fourier trans¬ 
form p(ui) is a probability distribution. From this fact, we have 

k(x — y) = J fc(w)e jajT(x-y) du = E w [r) u {x)ri u (y)*] , 

where j = a/—I, r^(.x') = e lujTx and •* denotes the complex conjugate. Since both k and k are real, the complex 
exponential contains only the cosine terms. Drawing D samples lowers the variance of the approximation. 

Theorem 2. Separation of variables. Let f be the Fourier transform of f. If f(x i,... ,Xd) = fi(xi) ■ ■ ■ fd(xd), then 

/(wi, ...,u> d ) = nti M^i). 

Theorem 2 suggests that the random Fourier features can be extended to a product kernel by drawing to independently for 
each kernel. 

B.2 MV (MEAN-VARIANCE) KERNEL 

Assume there are c incoming messages R := (r^)^ and S := ■ Assume that 

E r .( i) [x] = mi 
V r (o [x\ = vi 
[ y\ = Bi 
V s (d [y] = erf. 

Incoming messages are not necessarily Gaussian. The MV (mean-variance) kernel is defined as a product kernel on means 
and variances. 


Kmv (R, s) = n k ((mi - Pi) /w ™) k ((vi - erf) /w?) , 

i— 1 i=l 

where k is a Gaussian kernel with unit width. The kernel K mv has P ■.= (w ..., w™, w”,... ,w”) as its parameters. 
With this kernel, we treat messages as finite dimensional vectors. All incoming messages (sW )? =1 are represented as 
(pi, ..., p c , erf,..., erf ) T . This treatment reduces the problem of having distributions as inputs to the familiar problem of 
having input points from a Euclidean space. The random features of Rahimi and Recht (2007) can be applied straightfor¬ 
wardly. 

B.3 EXPECTED PRODUCT KERNEL 

Given two distributions r(x) = M(x\ m r , V r ) and s(y) = Af(y; m s , V s ) (d-dimensional Gaussian), the expected product 
kernel is defined as 

ttpro (G t>) — (pr : Ps) % — E r E s /c(x y) , 

where p r := E r k(x, •) is the mean embedding of r, and we assume that the kernel k associated with 7 ~L is translation 
invariant i.e., k(x. y) = k(x — y). The goal here is to derive random Fourier features for the expected product kernel. That 
is, we aim to find (j) such that ft pro (r, s ) « 4>(r) T 4>(s) and (f> £ M 15 . 

We first give some results which will be used to derive the Fourier features for inner product of mean embeddings. 
Lemma 3. Ifb ~ A f(b; 0, a 2 ), then E[cos(6)] = exp (— \o 2 \ 


Proof. We can see this by considering the characteristic function of x ~ A f(x\ p. a 2 ) which is given by 


For m = 0, t = 1, we have 


Cb( 1 ) = E b [exp (ib)] = exp 



E b [cos( 6 )], 


where the imaginary part i sin (tb) vanishes. 


□ 


From Rahimi and Recht (2007) which provides random features for k(x — y), we immediately have 

2 ° 

E r E s fc(a: — y) ~ E r E s — ^ cos (wj x + bi) cos (wj y + bi) 

^ i=l 

2 ° 

= — ^ E r(a .) cos (wjx + bi) E s(y) cos (wjy + bi), 

i =1 

where {Wi ~ fc(tu) (Fourier transform of k) and {bi}®. l ~ U [0, 27r]. 

Consider E r ( x ) cos (wjx + bi). Define Zi = to/"i + bi. So Zi ~ Af(zi] wj m r + b i: wj V r Wi). Let di ~ A/"(0, V r Wi). 
Then, r(di + m r + bj) = Af(wJ m r + bi, wj V r Wi ) which is the same distribution as that of Zi- From these definitions 
we have. 


E r(x) cos (wjx + bi ) = E r(z .) cos(zi) 

= E r ( d .) cos (di + wjm r + b^ 

== E r ( di ) cos (di) cos (wj m r + bi) — E,,( d .) sin(dj) sin(w7 m r + bi) 
= COs(w7 m r + ^i)E r ( ( i i ) cos(di) 

= COs(wJ ITlr + bi) exp -wj V r Wj\ , 


where at (a) we use cos(a + 0) = cos(a) cos(/3) — sin(a) sin(/3). We have ( 6 ) because sin is an odd function and 
E r ( d .) sin(dj) = 0. The last equality (c) follows from Lemma 3. It follows that the random features </>(r) € R 15 are given 


by 


U r ) 



cos(wJ m r + 6 i) exp (— ^wj V r w\) 
cos(u>Jm r + bo) exp (—^wJ,V r wjj) 


Notice that the translation invariant kernel k provides k from which { w ,}, are drawn. For a different type of distribution 
r, we only need to be able to compute E r(x ) cos (wjx + bi ). With <j>(r), we have K pro (r, s) ss i>(r) T i>(s) with equality in 
expectation. 


Analytic Expression for Gaussian Case For reference, if r, s are normal distributions and k is a Gaussian kernel, an 
analytic expression is available. Assume k(x — y) = exp ^ (x — y) T E _1 (x — y)^j where E is the kernel parameter. 
Then 


E r E s fc(x — y) = 


I det (D rs ) 
det(E- 1 ) 


exp 


(m r - m s ) T D r 


(m r — m s ) 


D rs (V/ + Vs + E) 


-l 


Approximation Quality The following result compares the randomly generated features to the true kernel matrix using 
various numbers of random features D. For each D. we repeat 10 trials, where the randomness in each trial arises from 
the construction of the random features. Samples are univariate normal distributions Af(rri. v) where m ~ 7V(0, 9) and 
v ~ Gamma(3,1/4) (shape-rate parameterization). The kernel parameter was E := a 2 1 where <j 2 = 3. 




n=1000, repeats=10 



“max entry diff" refers to the maximum entry-wise difference between the true kernel matrix and the approximated kernel 
matrix. 

B.4 PRODUCT KERNEL ON MEAN EMBEDDINGS 

Previously, we have defined an expected product kernel on single distributions. One way to define a kernel between two 
tuples of more than one incoming message is to take a product of the kernels defined on each message. 

Let fj, r m := E r ( 0 ( o )A;^(-, a) be the mean embedding (Smola et ah, 2007) of the distribution into RKHS induced 

by the kernel k. Assume = kg l J U ss (Gaussian kernel) and assume there are c incoming messages R := (rW(aW ))? =1 
and S := (s^ (b^))^ =1 . A product of expected product kernels is defined as 


fi-pro, prod(-R) R) • ( J 

\ 1=1 1=1 / 
c 

= nE r (z)( a (0)E s (z)( 6( z))^uss ( 'a^ l \b^ « 0(i^) T 0(5), 
1=1 


where 4>(R) T (f>{S) = J^_ 1 The feature map ca n be estimated by applying the random 

Fourier features to fcgauss and taking the expectations E r (i)/ 0 \E s (i)(j,\. The final feature map is (j>{R) = © 

</>( 2 )(r( 2 ))©- • -©0( c ) ( r ( c )) £ R- 0 , where © denotes a Rronecker product and we assume that ft 1 ' 1 £ for l £ c}. 


B.5 SUM KERNEL ON MEAN EMBEDDINGS 


If we instead define the kernel as the sum of c kernels, we have 


I (R: R) ^ 5 )'H (0 


1=1 


«££<V , > ) T£ { V , > ) 

i=i 


= 0(R) T 0(R), 


where (p(R) := ^>^(r^) T ,... £ M cD . 


B.6 NUMBER OF RANDOM FEATURES FOR GAUSSIAN KERNEL ON MEAN EMBEDDINGS 

We quantify the effect of D UI and D out empirically as follows. We generate 300 Gaussian messages, compute the true Gram 
matrix and the approximate Gram matrix given by the random features, and report the Frobenius norm of the difference 
of the two matrices on a grid of D m and D out . For each (l )\ n , -D ou t), we repeat 20 times with a different set of random 
features and report the averaged Frobenius norm. 
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The result suggests that D ou , has more effect in improving the approximation. 

C MORE DETAILS ON EXPERIMENT 1: BATCH LEARNING 

There are a number of kernels on distributions we may use for just-in-time learning. To find the most suitable kernel, 
we compare the performance of each on a collection of incoming and output messages at the logistic factor in the binary 
logistic regression problem i.e., same problem as in experiment 1 in the main text. All messages are collected by running 
EP 20 times on generated toy data. Only messages in the first five iterations are considered as messages passed in the early 
phase of EP vary more than in a near-convergence phase. The regression output to be learned is the numerator of (1). 

A training set of 5000 messages and a test set of 3000 messages are obtained by subsampling all the collected messages. 
Where random features are used, kernel widths and regularization parameters are chosen by leave-one-out cross validation. 
To get a good sense of the approximation error from the random features, we also compare with incomplete Cholesky 
factorization (denoted by IChol), a widely used Gram matrix approximation technique. We use hold-out cross validation 
with randomly chosen training and validation sets for parameter selection, and kernel ridge regression in its dual form 
when the incomplete Cholesky factorization is used. 

Let / be the logistic factor and mbe an outgoing message. Let be the ground truth belief message (numerator) 
associated with ro/_>.j. The error metric we use is KL[< 7 /_>.j || qf^-i] where <?/-u are the belief messages estimated by a 
learned regression function. The following table reports the mean of the log KL-divergence and standard deviations. 

mean log KL s.d. of log KL 


Random features + MV Kernel -6.96 1.67 

Random features + Expected product kernel on joint embeddings -2.78 1.82 

Random features + Sum of expected product kernels -1.05 1.93 

Random features + Product of expected product kernels -2.64 1.65 

Random features + Gaussian kernel on joint embeddings (KJIT) -8.97 1.57 

IChol + sum of Gaussian kernel on embeddings -2.75 2.84 

IChol + Gaussian kernel on joint embeddings -8.71 1.69 

Breiman’s random forests (Breiman, 2001) -8.69 1.79 

Extremely randomized trees (Geurts et ah, 2006) -8.90 1.59 

Eslami et al. (2014)’s random forests (Eslami et ah, 2014) -6.94 3.88 


n=300, repeats=50. Report Fro. norm(K - Khat) / n. 



outer features 


The MV kernel is defined in Section B.2. Here product (sum) of expected product kernels refers to a product (sum) of 
kernels, where each is an expected product kernel defined on one incoming message. Evidently, the Gaussian kernel on 
joint mean embeddings performs significantly better than other kernels. Besides the proposed method, we also compare 
the message prediction performance to Breiman’s random forests (Breiman, 2001), extremely randomized trees (Geurts 
et ah, 2006), and Eslami et ah (2014)’s random forests. We use scikit-learn toolbox for the extremely randomized trees and 
Breiman’s random forests. For Eslami et ah (2014)’s random forests, we reimplemented the method as closely as possible 
according to the description given in Eslami et ah (2014). In all cases, the number of trees is set to 64. Empirically we 
observe that decreasing the log KL error below -8 will not yield a noticeable performance gain in the actual EP. 














Figure 10: KL-divergence error versus predictive variance for predicting the mean of rri f ^ z . (normal distribution) in the 
logistic factor problem. 


To verify that the uncertainty estimates given by KJIT coincide with the actual predictive performance (i.e., accurate 
prediction when confident), we plot the predictive variance against the KL-divergence error on both the training and test 
sets. The results are shown in Fig. 10. The uncertainty estimates show a positive correlation with the KL-divergence errors. 
It is instructive to note that no point lies at the bottom right i.e., making a large error while being confident. The fact that 
the errors on the training set are roughly the same as the errors on the test set indicates that the operator does not overfit. 
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